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PATHWISE COORDINATE OPTIMIZATION 

By Jerome Friedman, 1 Trevor Hastie, 2 Holger Hofling 3 
and Robert Tibshirani 4 

Stanford University 

We consider "one-at-a-time" coordinate-wise descent algorithms 
for a class of convex optimization problems. An algorithm of this kind 
has been proposed for the Li-penalized regression (lasso) in the liter- 
ature, but it seems to have been largely ignored. Indeed, it seems that 
coordinate- wise algorithms are not often used in convex optimization. 
We show that this algorithm is very competitive with the well-known 
LARS (or homotopy) procedure in large lasso problems, and that it 
can be applied to related methods such as the garotte and elastic net. 
It turns out that coordinate- wise descent does not work in the "fused 
lasso," however, so we derive a generalized algorithm that yields the 
solution in much less time that a standard convex optimizer. Finally, 
we generalize the procedure to the two-dimensional fused lasso, and 
demonstrate its performance on some image smoothing problems. 

1. Introduction. In this paper we consider statistical models that lead 
to convex optimization problems with inequality constraints. Typically, the 
optimization for these problems is carried out using a standard quadratic 
programming algorithm. The purpose of this paper is to explore "one-at-a- 
time" coordinate- wise descent algorithms for these problems. The equivalent 
of a coordinate descent algorithm has been proposed for the Li-penalized 
regression (lasso) in the literature, but it is not commonly used. Moreover, 
coordinate-wise algorithms seem too simple, and they are not often used in 
convex optimization, perhaps because they only work in specialized prob- 
lems. We ourselves never appreciated the value of coordinate descent meth- 
ods for convex statistical problems before working on this paper. 

In this paper we show that coordinate descent is very competitive with 
the well-known LARS (or homotopy) procedure in large lasso problems, can 
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deliver a path of solutions efficiently, and can be applied to many other 
convex statistical problems such as the garotte and elastic net. We then 
go on to explore a nonseparable problem in which coordinate-wise descent 
does not work — the "fused lasso." We derive a generalized algorithm that 
yields the solution in much less time that a standard convex optimizer. 
Finally, we generalize the procedure to the two-dimensional fused lasso, and 
demonstrate its performance on some image smoothing problems. 

A key point here: coordinate descent works so well in the class of problems 
that we consider because each coordinate minimization can be done quickly, 
and the relevant equations can be updated as we cycle through the variables. 
Furthermore, often the minimizers for many of the parameters don't change 
as we cycle through the variables, and hence, the iterations are very fast. 

Consider, for example, the lasso for regularized regression [Tibshirani 
(1996)]. We have predictors Xij,j = 1,2,..., p, and outcome values for 
the ith observation, for i = 1,2, . . . ,n. Assume that the standardized 
so that J2i Xij/n = 0, J2i x % = 1- The lasso solves 



n / p x 

~ E Xi i@i 

P i=l \ 7=1 



min 1 



(1) 

subject to ^ \(5j \ < s. 

3=1 

The bound s is a user-specified parameter, often chosen by a model selection 
procedure such as cross-validation. Equivalently, the solution to (1) also 
minimizes the "Lagrange" version of the problem 

n / p \ 2 p 

(2) m = YL U-E^i +7EI/H 

i=l \ j=l ) j=l 

where 7 > 0. There is a one-to-one correspondence between 7 and the bound 
s — if /3(7) minimizes (2), then it also solves (1) with s = J2^=i I ^'(7) I- I n the 
signal processing literature, the lasso and L\ penalization is known as "basis 
pursuit" [Chen et al. (1998)]. 

There are efficient algorithms for solving this problem for all values of s 
or 7; see Efron et al. (2004), and the homotopy algorithm of [Osborne et al. 
(2000)]. There is another, simpler algorithm for solving this problem for a 
fixed value 7. It relies on solving a sequence of single-parameter problems, 
which are assumed to be simple to solve. 

With a single predictor, the lasso solution is very simple, and is a soft- 
thresholded version [Donoho and Johnstone (1995)] of the least squares es- 
timate (3: 

(3) /3 lasso ( 7 ) = S0, 7 ) = sign(/3)(|/3| - 7 )+ 



PATHWISE COORDINATE OPTIMIZATION 



3 




Fig. 1. The lasso problem with a single standardized predictor leads to soft thresholding. 
In each case the solid vertical line indicates the lasso estimate, and the broken line the 
least-squares estimate. 



f/3-7, if P>0 and 7 <|/3|, 

(4) = \p + l, if /3<0 and 7 < 1^1, 

U if 7 > \$\. 

This simple expression arises because the convex-optimization problem (2) 
reduces to a few special cases when there is a single predictor. Minimizing the 
criterion (2) with a single standardized x and (3 simplifies to the equivalent 
problem 

(5) mini(/3-/3) 2 + 7 |/?|, 

where (3 = J2i x iVi is the simple least-squares coefficient. If [3 > 0, we can 
differentiate (5) to get 

(6) |_,_j +7 _ a 

This leads to the solution (3 = (3 — 7 (left panel of Figure 1) as long as (3 > 
and 7 < (3, otherwise is the minimizing solution (right panel). Similarly, if 
(3 < 0, if 7 < — (3, then the solution is ft = f3 + 7, else 0. 

With multiple predictors that are uncorrelated, it is easily seen that once 
again the lasso solutions are soft-thresholded versions of the individual least 
squares estimates. This is not the case for general (correlated) predictors. 
Consider instead a simple iterative algorithm that applies soft-thresholding 
with a "partial residual" as a response variable. We write (2) as 

(7) f0) = hitU-T, x tePk-x ij (3 3 ) +7£l&l+7l#l, 

i=i\ k^j / k^j 
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Fig. 2. Diabetes data: iterates for each coefficient from algorithm (9). The algorithm 
converges to the lasso estimates shown on the right side of the plot. 



where all the values of (5k for k ^ j are held fixed at values fit (7) • Minimizing 
w.r.t. f3j, we get 

(8) kW<-s(£*ij(Vi-vhA> 

where =J2k^j x ikPk(l)- This is simply the univariate regression coef- 
ficient of the partial residual — y£' on the (unit L2-norm) jth. variable; 
hence, this has the same form as the univariate version (3) above. The up- 
date (7) is repeated for j = 1, 2, . . . ,p, 1,2,... until convergence. 
An equivalent form of this update is 

(9) ^^S^ji^ + ^Xijiyi-yi),^ j = l,2,...p,l,2,.... 

Starting with any values for Pj, for example, the univariate regression 
coefficients, it can be shown that the $j( , y) values converge to /r asso . 

Figure 2 shows an example, using the diabetes data from Efron et al. 

(2004) . This data has 442 observations and 10 predictors. We applied al- 
gorithm (9) with 7 = 88. It produces the iterates shown in the figure and 
converged after 14 steps to the lasso solution /r asso (88). 

This approach provides a simple but fast algorithm for solving the lasso, 
especially useful for large p. It was proposed in the "shooting" procedure 
of Fu (1998) and re-discovered by [Daubechies, Defrise and De Mol (2004)]. 
Application of the same idea to the elastic net procedure [Zhou and Hastie 

(2005) ] was proposed by [Van der Kooij (2007)]. 
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2. Pathwise coordinatewise optimization algorithms. The procedure de- 
scribed in Section 1 is a successive coordinate-wise descent algorithm for 
minimizing the function f(f3) = ~ J2i(Vi ~ J2j x ijPj) 2 + ^Dj=i \0j\- The idea 
is to apply a coordinate-wise descent procedure for each value of the reg- 
ularization parameter, varying the regularization parameter along a path. 
Each solution is used as a warm start for the next problem. 

This approach is attractive whenever the single-parameter problem is easy 
to solve. Some of these algorithms have already been proposed in the liter- 
ature, and we give appropriate references. Here are some other examples: 

• The nonnegative garotte. This method, a precursor to the lasso, was pro- 
posed by Breiman (1995) and solves 

n / p \ 2 » 

(10) 



mm 



5 E ( ^* _ E Xi J c i& + ^E c i subject to 



Cj > 0. 



Here 0j are the usual least squares estimates (we assume that p < n). 
Using partial residuals as in (7), one can show that the the coordinate- 
wise update has the form 



(11) 



where 0j = Ya=i x ij(Vi ~ v\ 3) )i and Vi" = J2k^j %ikCkPk- 
Least absolute deviation regression and LAD-lasso. Here the problem is 



(12) 



mm 



i=l 



Vi-Po 



We can write this as 



(13) 



El 



(w-Vi W ) 



Xij 



Pi 



holding all but (3j fixed. This quantity is minimized over j3j by a weighted 

median of the values {yi — y^)/xij. Hence, coordinate- wise minimization 
is just a repeated computation of medians. This approach is studied and 
refined by Li and Arce (2004). 

The same approach may be used in the LAD-lasso [Wang et al. (2006)] . 
Here we add an L\ penalty to the least absolute deviation loss: 



(14) 



mm 

P 



E 



p 

E 



p 



i=i 



This can be converted into an LAD problem (12) by augmenting the 
dataset with p artificial observations. In detail, let X denote n x [p + 
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1) model matrix (with the column of Is for the intercept). The extra p 
observations have response yi equal to zero, and predictor matrix equal to 
A - (0 : I p ). Then the above coordinate- wise algorithm for the LAD problem 
can be applied. 

• The elastic net. This method [due to Zhou and Hastie (2005)] adds a 
second constraint J2^=iP] ^ s 2 to the lasso (1). In Lagrange form, we 
solve 

n / n \ 2 p p 

(15) min EU-E x *Pi + A i E I ^ I + A 2 E $A 

' i=l\ 3=1 / 3=1 3=1 

The coordinate-wise update has the form 

(16) $ . 5(Er=i^(^-#),Ai)+ 



1 + A 2 

Thus, we compute the simple least squares coefficient on the partial resid- 
ual, apply soft-thresholding to take care of the lasso penalty, and then 
apply a proportional shrinkage for the ridge penalty. This algorithm was 
suggested by Van der Kooij (2007). 

Grouped lasso [Yuan and Lin (2006)]. This is like the lasso, except vari- 
ables occur in groups (such as dummy variables for multi- level factors). 
Suppose Xj is an N x pj orthonormal matrix that represents the j'th group 
of pj variables, j = 1, . . . , m, and (5j the corresponding coefficient vector. 
The grouped lasso solves 



(17) min 



J-- 



+ E A 



where Xj = X^/pJ. Other choices of Xj > are possible; this one penalizes 
large groups more heavily. Notice that the penalty is a weighted sum of L 2 
norms (not squared); this has the effect of selecting the variables in groups. 
Yuan and Lin (2006) argue for a coordinate-descent algorithm for solving 
this problem, and show through the Karush-Kuhn-Tucker conditions that 
the coordinate updates are given by 

(is) 4-- WW* - a ;)+hIt' 

where Sj = Xj(y - y^), and here = Y,k±j X A- 
• The "Berhu" penalty. This method is due to Owen (2006), and is another 
compromise between an L\ and L 2 penalty. In Lagrange form, we solve 
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(19) 

p r ii3 2 + 5 2 ) 

This penalty is the reverse of a "Huber" function — initially absolute value, 
but then blending into quadratic beyond 5 from zero. The coordinate-wise 
update has the form 



(20) 0j 



n 

E^(^-y°' ) )/( 1 + V5), if m>5. 



1=1 



This is a lasso-style soft-thresholding for values less than 5, and ridge-style 
beyond 5. 

Tseng (1988) [see also Tseng (2001)] has established that coordinate de- 
scent works in problems like the above. He considers minimizing functions 
of the form 

(21) /(&,...,&) = gifa ,...,&)+!> (Pj), 

i=j 

where g(-) is differentiable and convex, and the hj(-) are convex. Here each 
f3j can be a vector, but the different vectors cannot have any overlapping 
members. He shows that coordinate descent converges to the minimizer of /. 
The key to this result is the separability of the penalty function £)?=i hj(Pj), 
a sum of functions of each individual parameter. This result implies that the 
coordinate-wise algorithms for the lasso, the grouped lasso and elastic net, 
etc. converge to their optimal solutions. 

Next we examine coordinate-wise descent in a more complicated problem, 
the "fused lasso" [Tibshirani, Saunders, Rosset, Zhu and Knight (2005)], which 
we represent in Lagrange form: 

n / p \ 2 p p 

(22) f( f3 ) = iJ2(yi-J2x ij (3 j ) +\ l J2\p j \ + \ 2 J2\i3 j -p j _ 1 \. 

i=l\ j=l / j=l j=2 

The first penalty encourages sparsity in the coefficients; the second penalty 
encourages sparsity in their differences; that is, flatness of the coefficient 
profiles (3j as a function of the index set j. Note that f{(5) is strictly convex, 
and hence has a unique minimum. 

The left panel of Figure 3 shows an example of an application of the 
fused lasso, in a special case where the feature matrix {xij} is the iden- 
tity matrix — this is called fused lasso signal approximation, discussed in 
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gwaroe outer genome -order 

Fig. 3. Fused lasso applied to some Glioblastoma Multiforme data. The data are shown 
in the left panel, and the jagged line in the right panel represents the inferred copy number 
j3 from the fused lasso. The horizontal line is for y — 0. 

the next section. The data represents Comparative Genomic Hybridization 
(CGH) measurements from two Glioblastoma Multiforme (GBM) tumors. 
The measurements are "copy numbers" — log-ratios of the number of copies 
of each gene in the tumor versus normal tissue. The data are displayed in the 
left panel, and the red line in the right panel represents the smoothed signal 
j3 from the fused lasso. The regions of the nonzero estimated signal can be 
used to call "gains" and "losses" of genes. Tibshirani and Wang (2007) re- 
port excellent results in the application of the fused lasso, finding that the 
method outperforms other popular techniques for this problem. 

Somewhat surprisingly, coordinate-wise descent does not work for the 
fused lasso. Proposition 2.7.1 of Bertsekas (1999), for example, shows that 
every limit point of successive coordinate-wise minimization of a continu- 
ously differentiable function is a stationary point for the overall minimiza- 
tion, provided that the minimum is uniquely obtained along each coordi- 
nate. However, f{j3) is not continuously differentiable, which means that 
coordinate- wise descent can get stuck. Looking at Tseng's result, the penalty 
function for the fused lasso is not separable, and hence, Tseng's theorem 
cannot be applied in that case. 

Figure 4 illustrates the difficulty. We created a fused lasso problem with 
100 parameters, with the solutions for two of the parameters, (3%% and /?64, 
being equal to about —1. The top panels shows slices of the function f{(3) 
varying /?63 and /?64, with the other parameters set to the global minimizers. 
We see that the coordinate- wise descent algorithm has got stuck in a corner 
of the response surface, and is stationary under single-coordinate moves. 
In order to advance to the minimum, we have to move both /?63 and (3^ 
together. 
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Fig. 4. Failure of coordinate- wise descent in a fused lasso problem with 100 parameters. 
The optimal values for two of the parameters, and (3^4, are both —1.05, as shown 
by the dot in the right panel. The left and middle panels shows slices of the objective 
function f(/3) as a function of /363 and f3%4, with the other parameters set to the global 
minimizers. The coordinate-wise minimizer over both (3e3 and (3^4 (separately) is —0.69, 
rather than —1.05. The right panel shows contours of the two-dimensional surface. The 
coordinate-descent algorithm is stuck at (— 0.69, —0.69,). Despite being strictly convex, the 
surface has corners, in which the coordinate-wise procedure can get stuck. In order to travel 
to the minimum we have to move both /?63 and /?64 together. 

Despite this, it turns out that the coordinate- wise descent procedure can 
be modified to work for the fused lasso, yielding an algorithm that is much 
faster than a general quadratic-program solver for this problem. 

3. The fused lasso signal approximator. Here we consider a variant of 
the fused lasso (22) for approximating one- and higher-dimensional sig- 
nals, which we call the fused-lasso signal approximator (FLSA). For one- 
dimensional signals we solve 

n n n 

(23) mm f (f3) = \ ^{Vi - Pi) 2 + Xi £ |A| + A 2 £ |A - A-i|- 

^ i=l i=\ i=2 

The measurements yi are made along a one-dimensional index i, and there 
is one parameter per observation. Later we consider images as well. In the 
special case of Ai = 0, the fused-lasso signal approximator is equivalent to 
a discrete version of the "total variation denoising" procedure [Rudin et al. 
(1992)] used in signal processing. We make this connection clear in Section 6. 
Thus, the algorithm that we present here also provides a fast implementation 
for total variation denoising. 

Figure 5 illustrates an example of FLSA with 1000 simulated data points, 
and the fit is shown for s± = 269.2, S2 = 10.9. 

We now describe a modified coordinate-wise algorithm for the diagonal 
fused lasso (FLSA) using the Lagrange form (23). The algorithm can also 
be extended to the general fused lasso problem; details are given in the 
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Fig. 5. Fused lasso solution in a constructed example. 



Appendix. However, it is not guaranteed to give the exact solution for that 
problem, as we later make clear. 

Our algorithm, for a fixed value Ax, delivers a sequence of solutions cor- 
responding to an increasing sequence of values for A2. First we make an 
observation that makes it easy to obtain the solution over a grid of Ai and 
A2 values: 

Proposition 1. The solutions to (23) for all (A' x > Ai,A2) can be ob- 
tained by soft-thresholding the solution for (Ax,A2)- 

This is proven in the Appendix for the FLSA, two-dimensional penalty 
fused lasso and even more general penalty functionals. Thus, our overall 
strategy for obtaining the solution over a grid is to solve the problem for 
Ai = over a grid of values of A2, and then use this result to obtain the 
solution for all values of Ai. However, for a single (especially large) value 
of Ax, we demonstrate that it is faster to obtain the solution directly for 
that value of Ax (Table 2). Hence, we present our algorithm for fixed but 
arbitrary values of Ax- 

Two keys for the algorithm are assumptions (AI) and (A2) below, stating 
that for fixed Ax, small increments in the value of A2 can only cause pairs of 
parameters to fuse and they do not become unfused for the larger of A2. This 
allows us to efficiently solve the problem for a path of A2 values, keeping Ax 
fixed. 

The algorithm has three nested cycles: 

Descent cycle: Here we run coordinate-wise descent for each parameter (3j, 

holding all the others fixed. 
Fusion cycle: Here we consider the fusion of a neighboring pair of parameters, 

followed by coordinate- wise descent. 
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Fig. 6. Example of the one-dimensional search in the coordinate descent cycle for a FLSA 
problem with 5 parameters. Shown is the gradient for /?3 , with the other 4 parameters set 
at the global minimizing values. There are discontinuities at $^,pi and zero. We look 
for a zero-crossing in each of the intervals (—00,^4), (/3a, 0), (0,f32), (/?2,oo), and if none is 
found, take the minimum of f(f3) over the set of discontinuities. In this case, the minimum 
is at a discontinuity, with $3 = /?4 ■ 



Smoothing cycle: Here we increase the penalty A2 a small amount, and rerun 
the two previous cycles. 

We now describe each in more detail. 

Descent cycle. Consider the derivative of (23), holding all fit = Pk,k 7^ i 
fixed at their current values: 

^ = -(y l -ft) + Ai-sign(A) 

(24) 

- A 2 • sign(/3 i+ i - fa) + A 2 • sign(/3i - fa-i), 

assuming that fa ^ {Q, fa-\, fa + i} . 

The algorithm for coordinate- wise minimization of f{(3) works as follows. 
The expression in (24) is piecewise linear in fa with breaks at 0,/3j_i and 
Suppose, for example, at a given iteration we have < fa-i < fa+\. We 
check for a zero of (24) in each of the four intervals (—00, 0] , (0, fa-i], {fa-i, fa+i] 
and (/?j+i,oo). Since the function is piecewise linear, there is an explicit so- 
lution, when one exists. If no solution is found, we examine the three active- 
constraint values for fa: 0, fa-\ and and find the one giving the smallest 
value of the objective function f((3). Figure 6 illustrates the procedure on a 
simulated example. 

Other orderings among 0, fa-\ and fa+i are handled similarly, and at the 
endpoints i = 1 and i = p, there are only three intervals to check, rather than 
four. 



Fusion cycle. The descent cycle moves parameters one at a time. In- 
spection of Figure 4 shows that this approach can get stuck. One way to 
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get around this is to consider a potential fusion of parameters, when a move 
of a single (3i fails to improve the loss criterion. This amounts to enforc- 
ing \(3i — = by setting = = 7. With this constraint, we try a 
descent move in 7. Equation (24) now becomes 

-7^— = -(Vi-i - 7) - {Vi - 7) 

(25) + 2Ai • sign( 7 ) - A 2 • sign(/3 i+1 - 7) 

+ A 2 -sign(7-/3i_ 2 ). 
If the optimal value for 7 decreases the criterion, we accept the move setting 

& = 7. 

Notice that the fusion step is equivalent to temporarily collapsing the 
problem to one with p — 1 parameters: 

• we replace the pair yj_i and t/i with the average response y = (i/j-i +yi)/2 
and an observation weight of 2; 

• the pair of parameters and $ are replaced by a single 7, with a 
penalty weight of 2 for the first penalty. 

At the end of the entire process of descent and fusion cycles for a given 
A 2 , we identify adjacent nonzero solution values that are equal and collapse 
the data accordingly, maintaining a weight vector to assign weights to the 
observations averages and the contributions to the first penalty. 

Smoothing cycle. Although the fusion step often leads to a decrease in 
/(/?), it is possible to construct examples where, for a particular value of 
A 2 , no fusion of two neighbors causes a decrease, but a fusion of three or 
more can. Our final strategy is to solve a series of fused lasso problems 
sequentially, fixing Ai, but varying A 2 through a range of values increasing 
in small increments 5 from 0. 

The smoothing cycle is then as follows: 

1. Start with A 2 = 0, hence, with the lasso solution with penalty parameter 
Ai. 

2. Increment A 2 <— A 2 + 5, and run the descent and fusion cycles repeatedly 
until no further changes occur. After convergence of the process for a given 
value A 2 , identify neighboring solution values that are equal and nonzero 
and collapse the problem as described above, updating the weights. 

3. Repeat step 2 until a target value of A 2 is reached (or a target bound s 2 ). 

Our strategy relies on the following assumptions: 

(Al) If the increments 5 are sufficiently small, fusions will occur between 
no more than two neighboring points at a time. 
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(A2) Two parameters that are fused in the solution for (Ai, A2) will be fused 
for all (Ai, A 2 > A2). 

By collapsing the data after each solution, we can achieve long fusions by 
a sequence of pairwise fusions. Note that each of the fused parameters can 
represent more than one parameter in the original problem. For example, if 
j3j has a weight of 3, and a weight of 2, then the merged parameter 
has a weight of 5, and represents 5 neighboring parameters in the original 
problem. 

After m fusions, the problem has the form 

n—m n—m n—m 

(26) C m + mml ^ Wi { yi - fa) 2 + Ai ^ m\fa\ + IA — 

^ i=l i=l i=2 

Initially, m = 0, Wi = 1, and Co = 0. If the (m + l)st fusion is between fa-± 
and Pi, then the following updates occur: 

• y <- (wi-iyi-i + Wiyi) / (wi-i +Wi). 

• W + <— Wi-l + Wi. 

• C m+ i = C m + \[wi-i ■ (yi-i - y) 2 + Wi ■ - y) 2 }. 

• <- y, Wi-i <- io+ . 

• Discard observation z, and reduce all indices greater than i by 1. 

Note that we don't actually need to carry out the update for C m , because 
no parameters are involved. 

Figure 7 shows an example with just 9 data points. We have fixed Ai = 
0.01 and show the solutions for four values of A2. As A2 increases, the number 
of fused parameters increases. 

Assumption (AI) requires that the data have some randomness (i.e., no 
pre-existing flat plateaus exist). Assumption (A2) holds in general. We prove 
that both assumptions hold for the FLSA procedure in the next section. 

Numerical experiments show that (A2) does not always hold for the gen- 
eral fused lasso. Hence, the extension of this algorithm to the general fused 
lasso (detailed in the Appendix) is not guaranteed to yield the exact solution. 
Note that each descent and fusion cycle can only decrease the convex objec- 
tive and, hence, must converge. We terminate this pair of cycles when the 
change in parameter estimates is less than some threshold. The smoothing 
cycle is done over a discrete grid of A2 values. 

4. Optimality conditions. In this section we derive the optimality con- 
ditions for the FLSA problem, and use them to show that our algorithms' 
assumptions (AI) and (A2) are satisfied. 

We consider the Lagrangian form (23) for the fused lasso. The standard 
Karush-Kuhn-Tucker conditions for this problem are fairly complicated, 
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Fig. 7. Small example of the fused lasso. Ai is fixed at 0.01; as A2 increases, the number 
of fused parameters increases. 

since we need to express each parameter in terms of its positive and neg- 
ative parts in order to make the penalty differentiable. A more convenient 
formulation is through the sub-gradient approach [see, e.g., Bertsekas (1999), 
Proposition B.24]. The equations for the subgradient have the form 

- (2/1 - Pi) + A1S1 - X 2 t 2 = 0, 

(27) 

-{Vj - Pj) + XiSj + \ 2 {tj -t j+1 ) = 0, j = 2,...,n, 

with Sj = sign(/3j) if (3j / and Sj E [—1, 1] if (3j = 0. Similarly, tj = sign(/3j — 
fij-i) if 0j 7^ (3j-i and tj £ [—1, 1] if (3j = ftj-x- These n equations are neces- 
sary and sufficient for the solution. We restate assumptions (Al) and (A2) 
more precisely, and then prove they hold. 

Proposition 2. For the fused-lasso signal approximation algorithm de- 
tailed in Section 3: 

(Al') If the sequence yi are in general position — specifically, no two consec- 
utive yj values are equal — and the increments 5 are sufficiently small, 
fusions will occur between no more than two neighboring points at a 
time. 

(A2') Two parameters that are fused in the solution for (Ai, A2) will be fused 
for all (Ai, A' 2 > A2). 



PATHWISE COORDINATE OPTIMIZATION 



15 



Proof. We first prove (A2'). Suppose we have a stretch of nonzero 
solutions Pj-k, Pj-k+i, ■ ■ ■ ,Pj that are equal for some value A2, and j3j-k-\ 
and are not equal to this common value. Then tj_k and tj+i each 

take a value in {—1, 1}; we denote these boundary values by and Tj + \. 
Although the parameters tj-k+i, ■ ■ ■ , tj can vary in [—1,1] as A2 changes 
(while the fused group remains intact), the values depend on only the (j — 
k + l)st through jth equations in the system (27), because the boundary 
values are fixed. Taking pairwise differences, and using the fact that (3j_k = 
j3j-k+\ = ■ ■ ■ = (3j, this subgroup of equations simplifies to 



(28) 



/ 2 





V 



■1 








1 

A2 







\ 




-1 2 J 



( Vj-k+l 

Vj-k+2 - 



-Vj+k \ 
Vj-k+l 



\ 



Vj-l 

Vj- 



- Vj-2 

V3-1 



+ 



/ 



( tj-k+l \ 

tj-k+2 



V u J 






\T j+ iJ 



Write this system symbolically as Mt = j^Ay + T, and let C = M 1 . The ex- 
plicit form for C given in Schlegel (1970) gives Ca = (n — 1+ 1)/ (n+ 1), Cg n = 
l/(n + l). It is easy to check for all three possibilities for T that CT E [—1,1] 
elementwise. We know that t = (-^CAy + CT) £ [—1,1] elementwise as well, 
since t is a solution to (23) at A2- For X' 2 > X2, the elements of the first terms 
shrink, and hence the values t(X' 2 ) remain in [—1,1]. This implies that the 
fused set remains fused as we increase A2 . These equations describe the path 
t(\2) for A2 increasing, and only change when one of the boundary points 
(fused sets) is fused with the current set, and the argument is repeated. This 
proves (A2 ; ). 

We now address (Al'). Suppose the data are in general position (e.g., 
have a random noise component), and we have the lasso-solution j3j for Ai. 
Because of the randomness, no neighboring nonzero parameters (3j will be 
exactly the same. This means for each nonzero value f3j, we can write an 
equation of the form (27) where we know exactly the values for Sj, tj and tj+\ 
(each will be one of the values {—1, +1}). This means that we can calculate 
exactly the path of each such (5j as we increase A2 from zero, until an event 
occurs that changes the Sj, tj. By looking at all pairs, we can identify the 
time of the first fusion of such pairs. The data are then fused together and 
reduced, and the problem is repeated. Fusions occur one-at-a-time in this 
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Table 1 

Run times (CPU seconds) for lasso problems of various sizes n, p and different 
correlation between the features. Methods are the coordinate-wise optimization (Fortran), 
LARS (R and Fortran versions) and lasso2 (C language) — the homotopy procedure of 

Osborne et al. (2000) 



Method Population correlation between features 









n = 100,p = 


= 1000 











0.1 


0.2 


0.5 


0.9 


0.95 


coord-Fort 


0.31 


0.33 


0.40 


0.57 


1.20 


1.45 


LARS-R 


2.18 


2.46 


2.14 


2.45 


2.37 


2.10 


LARb-l'ort 


2.01 


2.09 


2.12 


1.947 


2.50 


2.22 


lasso2-C 


2.42 


2.16 


2.39 
n = 100,p = 


2.18 
= 5000 


2.01 


2.71 







0.1 


0.2 


0.5 


0.9 


0.95 


coord-Fort 


4.66 


4.51 


3.14 


5.77 


4.44 


5.43 


LARS-R 


28.40 


27.34 


24.40 


22.32 


22.16 


22.75 


LARS-Fort would not run 














lasso2 would not run 


















n= 100, p = 


20,000 











0.1 


0.2 


0.5 


0.9 


0.95 


coord-Fort 


7.03 


9.34 


8.83 


10.62 


27.46 


40.37 


LARS-R 


116.26 


122.39 


121.48 104.17 


100.30 


107.29 


LARS-Fort would not run 














lasso2 would not run 






n = 1000, p 


= 100 











0.1 


0.2 


0.5 


0.9 


0.95 


coord-Fort 


0.03 


0.04 


0.04 


0.04 


0.06 


0.08 


LARS-R 


0.42 


0.41 


0.40 


0.40 


0.40 


0.40 


LARS-Fort 


0.30 


0.24 


0.22 


0.23 


0.23 


0.28 


lasso2-C 


0.73 


0.66 


0.69 
n = 5000, p 


0.68 
= 100 


0.69 


0.70 







0.1 


0.2 


0.5 


0.9 


0.95 


coord-Fort 


0.16 


0.15 


0.14 


0.16 


0.15 


0.16 


LARS-R 


1.02 


1.03 


1.02 


1.04 


1.02 


1.03 


LARS-Fort 


1.07 


1.09 


1.10 


1.10 


1.10 


1.08 


lasso2-C 


2.91 


2.90 


3.00 


2.95 


2.95 


2.92 



fashion, at a distinct sequence of values for A2. Hence, for 5 small enough in 
our smoothing step, we can ensure that we encounter these fusions one at a 
time. □ 

5. Comparison of run times. In this section we compare the run times 
of the coordinate-wise algorithm to standard algorithms, for both the lasso 
and diagonal fused lasso (FLSA) problems. All timings were carried out on 
a Intel Xeon 2.80GH processor. 
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5.1. Lasso speed trials. We generated Gaussian data with n observations 
and p predictors, with each pair of predictors Xj,Xji having the same pop- 
ulation correlation p. We tried a number of combinations of n and p, with 
p varying from zero to 0.95. The outcome values were generated by 

p 

(29) Y = J2PjXj + k-Z, 

3=1 

where fa = exp(-2(j - 1) /20), Z ~ N(0, 1) and k is chosen so that the 
signal-to-noise ratio is 3.0. The coefficients are constructed to have alternat- 
ing signs and to be exponentially decreasing. 

Table 1 shows the average CPU timings for the coordinatewise algorithm, 
two versions of the LARS procedure and lasso2, an implementation of the ho- 
motopy algorithm of Osborne et al. (2000). All algorithms are implemented 
as R language functions. The coordinate-wise algorithm does all of its nu- 
merical work in Fortran, while lasso2 does its numerical work in C. LARS-R 
is the "production version" of LARS (written by Efron and Hastie), doing 
much of its work in R, calling Fortran routines for some matrix operations. 
LARS-Fort (due to Ji Zhu) is a version of LARS that does all of its numerical 
work in Fortran. Comparisons between different programs are always tricky: 
in particular, the LARS procedure computes the entire path of solutions, 
while the coordinate-wise procedure and lasso2 solve the problems for a set 
of pre-defined points along the solution path. In the orthogonal case, LARS 
takes min(ro,p) steps: hence, to make things roughly comparable, we called 
the latter two algorithms to solve a total of min(n,p) problems along the 
path. 

Not surprisingly, the coordinate-wise algorithm is fastest when the cor- 
relations are small, and gets slower when they are large. It seems to be 
very competitive with the other two algorithms in general, and offers some 
potential speedup, especially when n> p. 

Figure 8 shows the CPU times for coordinate descent, for the same prob- 
lem as in Table 1. We varied n and p, and averaged the times over five runs. 
We see that the times are roughly linear in n and in p. 

A key to the success of the coordinate-wise algorithm for lasso is the fact 
that, for squared error loss, the ingredients needed for each coordinate step 
can be easily updated as the algorithm proceeds. We can write the second 
term in (9) as 

n 

(30) ^Xij(yi -jji) = (xj,y) - ^ (xj,x k }f3 k , 

where (xj,y) = Y^l=\ x iiVi-, an d so on. Hence, we need to compute inner 
products of each feature with y initially, and then each time a feature x k 
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Fig. 8. CPU times for coordinate descent, for the same problem as in Table 1, for dif- 
ferent values of n and p. In each case the times are averaged over five runs and averaged 
over the set of values of the other parameter (n or p). 



enters the model, we need to compute its inner product with all the rest. 
But importantly, 0{n) calculations do not have to be made at every step. 
This is the case for all penalized procedures with squared error loss. 

Friedlander and Saunders (2007) do a thorough comparison of the LARS 
(homotopy) procedure to a number of interior point QP procedures for the 
lasso problem, and find that LARS is generally much faster. Our finding 
that coordinate descent is very competitive with LARS therefore suggests 
that also will outperform interior point methods. 

Finally, note that there is another approach to solving the FLSA problem 
for Ai = 0. We can transform to parameters Oj = Pj —/3j—i, and we get a new 
lasso problem in these new parameters. One can use coordinate descent to 
solve this lasso problem, and then Proposition 1 gives the FLSA solution for 
other values of Ai. However, this new lasso problem has a dense data matrix, 
and hence, the coordinate descent procedure is many times slower than the 
procedure described in this section. The procedure developed here exploits 
the near-diagonal structure of the problem in the original parametrization. 

5.2. Fused lasso speed trials. For the example of Figure 5, we com- 
pared the pathwise coordinate algorithm to the two-phase active set al- 
gorithm sqopt of Gill, Murray and Saunders (1999). Both algorithms are 
implemented as R functions, but do all but the setup and wrapup compu- 
tations in Fortran. Table 2 shows the timings for the two algorithms for a 
range of values of Ai and A2- The resulting number of active constraints 
(i.e., Pj = or Pj — Pj~\ = 0) is also shown. In the second part of the table, 
we increased the sample size to 5000. We see that the coordinate algorithm 
offers substantial speedups, by factors of 50 up to 300 or more. 

In these tables, each entry for the pathwise coordinate procedure is the 
computation time for the entire path of solutions leading to the given val- 
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ues Ai,A2- In practice, one could obtain all of the solutions for a given Ai 
from a single run of the algorithm, and hence, the numbers in the table 
are very conservative. But we reported the results in this way to make a 
fair comparison with the standard procedure since it can also exploit warm 
starts. 

In Table 3 we show the run times for pathwise coordinate optimization 
for larger values of re. As in the previous table, these are the averages of 
run times for the entire path of solutions for a given Ai, and hence, are 
conservative. We were unable to run the standard algorithm for these cases. 

6. The two-dimensional fused lasso. Suppose we have a total of n 2 cells, 
laid out in a n x n grid (the square grid is not essential). We can generalize 
the diagonal fused lasso (FLSA) to two-dimensions as follows: 

n n 

m i n \ H (#**' ~ ^') 2 

P i=li'=l 

Table 2 

Run times (CPU seconds) for fused lasso (FLSA) problems of various sizes n for different 
values of the regularization parameters A1.A2. The methods compared are the pathwise 
coordinate optimization, and " standard" -two-phase active set algorithm sqopt of 
Gill, Murray and Saunders (1999). The number of active constraints in 
the solution is shown in each case 

Ai A2 # Active Coord Standard 



n = 1000 



0.00 


0.01 


456 


0.040 


2.100 


0.00 


1.00 


934 


0.024 


0.931 


0.00 


2.00 


958 


0.019 


0.987 


1.00 


0.01 


824 


0.022 


1.519 


1.00 


1.00 


975 


0.024 


1.561 


1.00 


2.00 


981 


0.023 


1.404 


2.00 


0.01 


861 


0.023 


1.499 


2.00 


1.00 


983 


0.023 


1.418 


2.00 


2.00 


991 


0.018 


1.407 






n = 5000 






0.000 


0.002 


4063 


0.217 


20.689 


0.000 


0.200 


3787 


0.170 


26.195 


0.000 


0.400 


4121 


0.135 


29.192 


0.200 


0.002 


4305 


0.150 


41.105 


0.200 


0.200 


4449 


0.141 


48.998 


0.200 


0.400 


4701 


0.129 


45.136 


0.400 


0.002 


4301 


0.108 


41.062 


0.400 


0.200 


4540 


0.123 


41.755 


0.400 


0.400 


4722 


0.119 


38.896 
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Table 3 

Run times (CPU seconds) for pathwise coordinate optimization applied 
to fused lasso (FLSA) problems with a large number of parameters n 
averaged over different values of the regularization parameters Ai , A2 



n 


Average CPU sec 


100,000 


3.54 


500,000 


14.93 


1,000,000 


29.81 



n n 

subject to \ - Sl > 

. , i=ii'=i 
(31) 

n n 

i=l i'=2 
n n 

mi l/V -A-i.t'i < s 3- 

i=2i'=l 

The penalties encourage the parameter map to be both sparse and spa- 
tially smooth. 

The fused lasso is related to signal processing methods such as "total vari- 
ation denoising" [Rudin et al. (1992)], which uses a continuous smoothness 
penalty analogous to the second penalty in the fused lasso. The TV criterion 
is written in the form 

(32) min / \Vu\du subject to \\u — y|| 2 = a 2 , 

u Jn 

where y is the data, u is the approximation with allowable error a 2 , Q is a 
bounded convex region in R d , \ ■ \ denotes Euclidean norm in R d and || • || 
denotes the norm on L 2 (f2). Thus, in d = 1 dimension, this is a continuous 
analogue of the fused lasso, but without the (first) L\ penalty on the coef- 
ficients. In d = 2 dimensions, the TV approach is different: the discretized 
version uses the Euclidean norm of the first differences in u, rather than the 
sum of the absolute values of the first differences. 

This problem (32) can be solved by a general purpose quadratic-programming 
algorithm; we give details in the Appendix. However, for a p x p grid, there 
are 7p 2 variables and 3p 2 + 3 constraints, in addition to nonnegativity con- 
straints on the variables. For p = 256, for example, this is 458, 752 variables 
and 196,611 constraints, so that finding the exact solution is impractical. 

Hence, we focus on the pathwise coordinate algorithm. The algorithm 
has the same form as in the one-dimensional case, except that rather than 
checking the three active constraint values 0, and /3j+i, we check and 
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the four values to the right, left, above and below (its four-neighborhood). 
The number of constraint values reduces to 4 at the edges and 3 at the 
corners. The algorithm starts with individual pixels as the groups, and the 
four-neighborhood pixels are its "distance-1 neighbors." In each fusion cycle 
we try to fuse a group with its distance-1 neighbors. If the fusion is ac- 
cepted, then the distance-1 neighbors of the fused group are the union of 
the distance-1 neighbors of the two groups being joined (with the groups 
themselves removed). Now one pixel might be the distance-1 neighbor to 
each of the two groups being fused, and some careful bookkeeping is re- 
quired to keep track of this through appropriate weights. Full details are 
given in the Appendix. 

We do not provide a proof of the correctness of this procedure. However, 
in our (limited) experiments we have found that it gives the exact solution to 
the two-dimensional fused lasso. We guess that a proof along the lines of that 
in the one-dimensional case can be constructed, although some additional 
assumptions may be required. 

As in the one-dimensional FLSA (Proposition 1), if we write the problem 
in terms of Lagrange multipliers (Ai, A2, A3), the solution for (X[ > Ai, A2, A3) 
can be obtained by soft-thresholding the solutions for (Ai,A2,A3). 

6.1. Example 1. Figure 9 shows a toy example. The data are in the 
top left, representing a "+"-shaped image with iV(0, 1) noise added. The 
reconstruction by the lasso and fused lasso are shown in the other panels. 
In each case we did a grid search over the tuning parameters using a kind 
of two- fold validation. We created a training set of the odd pixels (1, 3, 5 . . . 
in each direction) and tested it on the even pixels. For illustration only, we 
chose the values that minimized the squared reconstruction error over the 
test set. We see that the fused lasso has successfully exploited the spatial 
smoothness and provided a much better reconstruction than the lasso. 

Table 4 shows the number of CPU seconds required for the standard and 
pathwise coordinate descent algorithms, as n increases. We were unable to 
apply the standard algorithm for n = 256 (due to memory requirements), 
and have instead estimated the time by crude quadratic extrapolation. 

6.2. Example 2. Figure 10 shows another example. The noiseless image 
(top panel) was randomly generated with 512 x 512 pixels. The background 
pixels are zero, while the signal blocks have constant values randomly chosen 
between 1 and 4. The top right panel shows the reconstruction by the fused 
lasso: as expected, it is perfect. In the bottom left we have added Gaussian 
noise with standard deviation 1.5. The reconstruction by the fused lasso 
is shown in the bottom right panel, using two-fold validation to estimate 
Ai,A2- The reconstruction is still quite good, capturing most of the impor- 
tant features. In this example we did a search over 10 Ai values. The entire 
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Fig. 9. A toy example: the data are in the top left, representing a "+" -shaped image 
with added noise. The reconstructions by the lasso and fused lasso are shown in the other 
panels. In each case we did a grid search over the tuning parameters using a kind of 
two-fold validation. 

Table 4 

2D fused lasso applied to the toy problem. The table shows the number of 
CPU seconds required for the standard and pathwise coordinate descent 
algorithms, as n increases. The regularization parameters were set at 
the values that yielded the solution in the bottom left panel of Figure 9 



n 


Standard 


Coord 


8 


2.0 s 


0.07 s 


16 


3.4 s 


0.13 s 


32 


20.8 s 


0.38 s 


256 


38 min 


7.1 s 
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computation for the bottom right panel of Figure 10, including the two-fold 
validation to estimate the optimal values for Ai and A2, took 11.3 CPU 
minutes. 



6.3. Example 3. The top left panel of Figure 11 shows a 256 x 256 gray 
scale image of statistician Sir Ronald Fisher. In the top right we have added 
Gaussian noise with a standard deviation 2.5. We explore the use of the two- 
dimensional fused lasso for denoising this image. However, the first (lasso) 
penalty doesn't make sense here, as zero does not represent a natural base- 
line. So instead, we tried a pure fusion model, with Ai = 0. We found the 
best value of A2 , in terms of reconstruction error from the original noiseless 
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Fig. 11. Top-left panel: 256 x 256 gray scale image of statistician Sir Ronald Fisher. 
Top-right panel: Gaussian noise with standard deviation 2.5 has been added. Bottom-left 
panel: best solution with A 2 set to zero (pure lasso penalty); this gives no improvement 
in reconstruction error. Bottom-right panel: best solution with Ai set to zero (pure fusion 
penalty). This reduces the reconstruction error from 6.18 to 1.15. 

image. The solution shown in the bottom right panel gives a reasonable ap- 
proximation to the original image and reduces the reconstruction error from 
6.18 to 1.15. In the bottom left panel we have set A2 = 0, and optimized over 
Ai . As expected, this pure lasso solution does poorly, and the optimal value 
of Ai turned out to be 0. 

6.4. Applications to higher dimensions and other problems. The general 
strategy for the two-dimensional fused lasso can be directly applied in higher 
dimensional problems, the difference being that each cell would have more 
potential distance-1 neighbors. In fact, the same strategy might be applicable 
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to non-Euclidean problems. All one needs is a notion of distance-1 neighbors 
and the property that the distance-1 neighbors of a fusion of two groups are 
the union of the distance-1 neighbors of the two groups, less the fused group 
members themselves. 

7. Discussion. Coordinate-wise descent algorithms deserve more atten- 
tion in convex optimization. They are simple and well-suited to large prob- 
lems. We have found that for the lasso, coordinate-wise descent is very 
competitive with the LARS algorithm, probably the fastest procedure for 
that problem to-date. Coordinate-wise descent algorithms can be applied 
to problems in which the constraints decouple, and a generalized version of 
coordinate-wise descent like the one presented here can handle problems in 
which each parameter is involved in only a limited number of constraints. 
This procedure is ideally suited for a special case of the fused lasso — the 
fused lasso signal approximator, and runs many times faster than a stan- 
dard convex optimizer. On the other hand, it is not guaranteed to work for 
the general fused lasso problem, as it can get stuck away from the solution. 

7.1. Software. Both Fortran and R language routines for the lasso, and 
the one- and two-dimensional fused lasso will be made freely available. 

APPENDIX 

A.l. Proof of Proposition 1. Suppose that we are optimizing a function 
of the form 

n n 

/(/3) = ^E(^-^) 2 + A iEi^i+ E hslPi-Pil 

i=i i=i (i,j)ec 

where (i,j) £ C if the difference — f3j\ is L\ penalized with penalty pa- 
rameter Aj j. This general form for the penalty includes the following models 
discussed earlier: 

Fused lasso signal approximator: Here, (i,j) SC if i=j — 1. Furthermore, 
\,i = A2. 

Two-dimensional fused lasso: Here i itself is a two-dimensional coordinate 
i = (h,i 2 ). Let (i,j) £ C if \h - ji| + \i 2 - J2I = 1- If \h - ji\ = 1, then 
\,j = A2, otherwise Ajj = A3. 

Now we prove a soft thresholding result. 

Lemma A.l. Assume that the solution for \\ = and \ 2 > is known 
and denote it by (3(0, X 2 )- Then, the solution for Ai > is 

&(Ai, A 2 ) = sign(A(0, A 2 ))(|ft(0, A 2 )| - Ax) + . 
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Proof. First find the subgradient equations for Pi, . . . ,p n , which are 
9i = -{Vi ~ Pi) + Al Si + ^tj,i=0, 

where Sj = sign(/3j) if Pi ^ and Si G [—1, 1] if Pi = 0. Also, tij = sign(/3j — j3j) 
for Pi / /3j and G [—1, 1] if Pi = Pj. These equations are necessary and 
sufficient for the solution. 

As it is assumed that a solution for Ai = is known, let s(0) and t(0) 
denote the values of the parameters for this solution. Specifically, Sj(0) = 
sign(/3j(0)) for /3j(0) / and for Pi(0) = 0, it can be chosen arbitrarily, so 
set Sj(0) = 0. Note that as A2 is constant throughout the whole proof, the 
dependence of P, s and t on A2 is suppressed for notational convenience. 

In order to find i(Ai), observe that soft thresholding of /3(0) does not 
change the ordering of pairs Pi{\\) and Pj(Xi) for those pairs for which at 
least one of the two is not and, therefore, it is possible to define if ?(Ai) = 
ti t j(0). If Pi(Xi) = Pj(X\) = 0, then tij can be chosen arbitrarily in [—1, 1] 
and, therefore, let Uj(Xi) =ti.j(0). Thus, without violating restrictions on 
tij, t(Xi) = t(0) for all Ai > 0. s(Ai) will be chosen appropriately below so 
that the subgradient equations hold. 

Now insert Pi{X\) = sign(/3j(0))(|/3j(0)| — Ai) + into the subgradient equa- 
tions. For Ai > 0, look at 2 cases: 

Case 1. |#(0)|>Ai 

9i{Xi) = -yi + Pi(0) - AiSi(O) + AiSi(Ai) 

+ X] ^2*ij(Al) — Y, ^,i*j,i(Al) 
3'-(i,j)eC j':(j',i)eC 

= -y* + A(o)+ Yl A 2^'(0)- Y A iAi(°) = ° 

by setting Sj(Ai) = Sj(0), using the definition of t(X\) and noting that /3(0) 
was assumed to be a solution. 

Case 2. \Pi(0)\ < X±. Here, P(Xi) = and, therefore, 

9i(M) = -Vi + AiSj(Ai) + A 2 tjj(Ai) - ^ Aj^^Ai) 

= -Vi + PM+ Y x 2ti,j{0)- Y A i,^,i(0) = 

by choosing Sj(Ai) = /3j(0)/Ai G [—1, 1] and again using that /3(0) is optimal. 
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As the subgradient equations hold for every Ai > 0, soft thresholding gives 
the solution. Note that we have assumed that Xij = A2, but this proof works 
for any fixed values Ajj. 

Using this theorem, it is possible to make a more general statement. 

Proposition A.l. Let /3(Ai,A2) be a minimum of /(/?) for (Ai,A2). 
Then the solution for the parameters (A' 1; A2) with A^ > Ai is a soft thresh- 
olding of (3(\i,\<2,), that is, 

0(X[, A 2 ) = sign(/3(A 1 ,A 2 ))(/3(A 1 , A 2 ) - (\[ - A 2 )) + . 

Proof. As a solution for minimizing f{(3) exists and is unique for all 
Ai, A2 > 0, the solution /3(0, A2) exists and /3(Ai, A 2 ) as well as (3(\'i, A2) are 
soft-thresholded versions of it using the previous theorem. Therefore, 

A(Ai, A 2 ) = sign(A(0, A 2 ))(|A(0, A 2 )| - Ai) + , 
A(A;, A 2 ) = sign(A(0, A 2 ))(|A(0, A 2 )| - Ai)+ 

for i = l,...,n. If /3j(Ai,A 2 ) = 0, then also /3 i (A' 1 ,A 2 ) = 0. For / S i (Ai,A 2 ) > 
0, the soft-thresholding implies that the sign did not change and, thus, 
sign(/3j(Ai, A 2 )) = sign(/3j(0, A 2 )). It then follows 

/3 4 (A , 1 ,A2) = sign(/3 4 (0,A2))(|A(0,A 2 )|-A' 1 ) + 

= sign(A(A!, A 2 ))(|A(Ai, A 2 )| - (X[ - Xi)) + . 

Therefore, /3(A' 1; A 2 ) is a soft-thresholded version of /3(Ai, A 2 ). □ 

Quadratic programming solution for the two-dimensional fused lasso. Here 
we outline the solution to the two-dimensional fused lasso using the general 
purpose sqopt package of Gill, Murray and Saunders (1999). 

Let fa = (3+ - with p+,,0-, > 0. Define 9% = - for i > 

1> ViV = #,<' " /V-l for *' > !. and n = /5n- Let = " e ii with 
®i' + >@i>~ — 0; an d similarly for 0^,. We string out each set of parameters 
into one long vector, starting with the 11 entry in the top left, and going 
across each row. 

Let L\ and Li be the nxn matrices that compute horizontal and vertical 
differences. Let e be a column ?i-vector of ones, and / be the nxn identity 
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matrix. Then the constraints can be written as 



(33) 
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Here oq = (oo, 0, ... 0). Since Pu> = Qui, setting its bounds at ±oo avoids a 
"double" penalty for \Pu'\ and similarly for f3n>. Similarly, eo equals e, with 
the first component set to zero. 

Pathwise coordinate optimization for the general one- dimensional fused 
lasso. This algorithm has exactly the same form as that for the fused lasso 
signal approximator given earlier. The form of the basic equations is all that 
changes. 

Equation (24) becomes 

/ p 



9(3, 



E 

8=1 



(34) 



+ Ai - sign (/?,■; 
+ A 2 • sign(/3j 



- A 2 -sign(/3 i+ i -fa) 
Pj-i), 



assuming that (3j £ {0, Pj-i, Pj+i}- 
Similarly, expression (25) becomes 



g/Og) 

c?7 



n 
i=l 



^7 



(35) 



where z,- 



+ 2Ai • sign(7) - A 2 • sign(/3 i+2 - 7) 

+ A 2 •sign(7-/3 i _ 1 ), 

+ If the optimal value for 7 decreases the criterion, we 

accept the move setting (3j = (3j-i = 7. 

Pathwise coordinate optimization for two-dimensional fused lasso signal 
approximator. Consider a set (grid) G of pixels p = with 1 < i < m, 

1 < i < 7i2- Associated with each pixel is a data value y p = yij . The goal is 
to obtain smoothed values (3 p = 0ij for the pixels that jointly solve 

W*{ h EEfe - ^') 2 + Al £ £ 1 Ail 

i^-f i=lj=l i=lj=l 
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(36) 



ni 712 ni ri2 



+ A 2 E E i/% - A-u I + E E I - 

i=2j=2 i=lj=l 

Defining the distance between two pixels p = andp' = (i',f) as d(p,p') = 
K ~~ + \j ~ (36) can be expressed as 

miniE^P"^) 2 

+Ai£lA>l + (V2) E \0p-M- 

p£G d(p,p')=l 

Consider a partition of G into K contiguous groups {Gk}, G = \J Gk and 
Gk n Gk> = 0, k ^ A;'. A group Gfc is contiguous if any p G 67*,. can be reached 
from any other p' G G/% by a sequence of distance-one steps within Gk . Define 
the distance between two groups Gk and Gk' as 

D(k,k')= mm d(p,p'). 

P&G k 
P 'eG k , 

Suppose one seeks the solution to (37) under the constraints that all 
pixels in the same group have the same parameter value. That is, for each 
Gk, {(3 P = 7fc}peG fe - The corresponding optimal group parameter values jk 
are the solution to the unconstrained problem 

K 

^ k=l 

(38) 

+ Ai E^-It/cI + (A 2 /2) E w kk'\lk -Tfe'l) 

k=l D(k,k')=l 

where Nj. is the number of pixels in Gk, Vk is the mean of the pixel data 
values in Gk, and 

wkk>= E E 7 [ d (P'P') = !]■ 

p&G h p'&Gy 

Note that (38) is equivalent to (37) when all groups contain only one pixel. 

Further, suppose that for a given partition one wishes to obtain the solu- 
tion to (38) with the additional constrain that two adjacent groups Gi and 
Gi>, D(l,l') = 1, have the same parameter value 7 m ; that is, 7/ = 7/' = j m , 
or equivalently, {j3 p = 7 m }peG;UG i / • This can be accomplished by deleting 
groups I and V from the sum in (38) and adding the corresponding "fused" 
group G m = GiU G v , with N m = N t + N v , y m = {N m + N v y v )/N m , 

{Gk'}D(m,k')=l 

(39) 

= {Gk'}D(i,k')=i u {Gk'}D(i>,k')=i - Gi - Gv, 
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and w mk > = w w + w Vk i . 

The strategy for solving (36) is based on (38). As with FLSA (Section 
3), there are three basic operations: descent, fusion and smoothing. For a 
fixed value of Ai, we start at A2 = and n\ ■ ri2 groups each consisting of 
a single pixel. The initial A2 = solution of each j k is obtained by soft- 
thresholding 7/; = S(yk, Ai) as in (3). Starting at this solution, the value of 
A2 is incremented by a small amount A2 <— A2 + 6. Beginning with 71, the 
descent operation solves (38) for each 7^ holding all other {jk'}k'^k at their 
current values. The derivative of the criterion in (38) with respect to jk 
is piecewise linear with breaks at 0, {jk'}D(k,k')=i- The solution for jk is 
thus obtained in the same manner as that for the one-dimensional problem 
described in Section 3. If this solution fails to change the current value of 7^, 
successive provisional fusions of Gk with each Gk' for which D(k,k') = 1 
are considered, and the solution for the corresponding fused parameter j m 
is obtained. The derivative of the criterion in (38) with respect to -y m is 
piecewise linear with breaks at 0, {7fc'}-D(m,fc')=i (39). If any of these fused 
solutions for j m improves the criterion, one provisionally sets 7^ = 7^ = 7 m . 
If not, the current value of 7^ remains unchanged. 

One then applies the descent /fusion strategy to the next parameter, k <— 
k + 1, and so on until a complete pass over all parameters {7^} has been 
made. These passes (cycles) are then repeated until one complete pass fails 
to change any parameter value, at which point the solution for the current 
value of A2 has been reached. 

At this point each current group Gk is permanently fused (merged) with 
all groups Gk' for which 7^ = 7^/, 7^ 7^ 0, and D(k, k') = 1, producing a new 
criterion (38) with potentially fewer groups. The value of A2 is then further 
incremented A2 <— A2 + S and the above process is repeated starting at the 
solution for the previous A2 value. This continues until a specified maximum 
value of A2 has been reached or until only one group remains. 
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